Session 9: Scraping Interactive Web Pages (part 2)
Introduction to Web Scraping and Data Management for Social Scientists
Johannes B. Gruber
2025-07-17
Browser automation
What is Browser Automation?
Definition: the process of using software to control web browsers and interact with web elements programmatically
Tools Involved: Common tools include Selenium, Puppeteer, and Playwright
These tools allow scripts to perform actions like clicking, typing, and navigating through web pages automatically
Common Uses of Browser Automation
Testing: Widely used in software development for automated testing of web applications to ensure they perform as expected across different environments and browsers
Task Automation: Simplifies repetitive tasks such as form submissions, account setups, or any routine that can be standardized across web interfaces
Browser Automation in Web Scraping
Dynamic Content Handling: Essential for scraping websites that load content dynamically with JavaScript. Automation tools can interact with the webpage, wait for content to load, and then scrape the data.
Simulation of User Interaction: Can mimic human browsing patterns to interact with elements (like dropdowns, sliders, etc.) that need to be manipulated to access data
Avoiding Detection: More sophisticated than basic scraping scripts, browser automation can help mimic human-like interactions, reducing the risk of being detected and blocked by anti-scraping technologies
Example: Google Maps
Goal
Check the communte time programatically
Extract the distance and time it takes to make a journey
The new read_html_live from rvest solves this by emulating a browser:
# loads a real web browsersess <-read_html_live("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")
You can have a look at the browser with:
sess$view()
Unfortunately, we do not get content yet. We first have to click on “Accept all”
Let’s use browser Automation
After manipulating something about the session, you need to read it into R again:
# the session behaves like a normal rvest html objecttrip <- sess |>html_elements("#section-directions-trip-0")trip |>html_element("h1") |>html_text2()
Control Playwright from r with an experimental package
I did not write the package, but made some changes to make it easier to use.
To get started, we first initialize the underlying Python package and then launch Chromium:
library(reticulate)library(playwrightr)pw_init()chrome <-browser_launch(browser ="chromium", headless =!interactive(), # make sure data like cookies are stored between sessionsuser_data_dir ="user_data_dir/")
When you are in Europe, the page asks for consent to save cookies in your browser:
Getting more posts
This page loads new content when you scroll down. We can do this using the scroll function:
scroll(page)
Getting the page content
Okay, we now see the content. But what about collecting it? We can use several different get_* functions to identify specfic elements. But wen can also simply get the entire HTML content:
html <-get_content(page)html
Conveniently, this is already an rvest object. So we can use our familiar tools to get to the links of the visible posts. The page uses a role attribute which Iemploy here and I know that links to posts contain posts:
post_links <- html |>html_elements("[role=\"link\"]") |>html_attr("href") |>str_subset("posts")head(post_links)
Collecting Post content
Now we can visit the page of one of these posts and collect the content from it:
post1 <-new_page(chrome)# go to the pagegoto(post1, post_links[1])post1_html <-get_content(post1)
parse_path <-function(ix) { out <-as.list(ix$p) out[which(ix$p ==as.character(ix$pos))] <- ix$pos[ix$p ==as.character(ix$pos)]gsub("list(", "purrr::pluck(DATA, ", deparse1(out), fixed =TRUE)}#' Search a list#'#' @param l a list#' @param f a function to identify the element you are searching#'#' @return an object containing the searched element with the function to extract it as a name#' @exportlist_search <-function(l, f) { paths <- rrapply::rrapply(object = l,condition = f,f =function(x, .xparents, .xname, .xpos) list(p = .xparents, n = .xname, pos = .xpos),how ="flatten" ) out <- purrr::map(paths, function(p) purrr::pluck(l, !!!p$pos))names(out) <- purrr::map_chr(paths, parse_path)return(out)}
Not sure how scalable this is or how stable. But it seems like we got the data for one post, at least. After getting the content you want (or not), we can close the page: